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Abstract 



< 

^ . We propose automatically learning probabilistic Hierarchical Task Networks (pH- 

O \ TNs) in order to capture a user's preferences on plans, by observing only the 

user's behavior. HTNs are a common choice of representation for a variety of 
purposes in planning, including work on learning in planning. Our contributions 
■ are (a) learning structure and (b) representing preferences. In contrast, prior work 

. employing HTNs considers learning method preconditions (instead of structure) 

^ ! and representing domain physics or search control knowledge (rather than pref- 

^ \ erences). Initially we will assume that the observed distribution of plans is an 

O \ accurate representation of user preference, and then generalize to the situation 

^ ■ where feasibility constraints frequently prevent the execution of preferred plans. 

In order to learn a distribution on plans we adapt an Expectation-Maximization 
(EM) technique from the discipline of (probabilistic) grammar induction, taking 
^ \ the perspective of task reductions as productions in a context-free grammar over 

I primitive actions. To account for the difference between the distributions of possi- 

ble and preferred plans we subsequently modify this core EM technique, in short, 
by rescaling its input. 
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Figure 1 : Hierarchical task networks in a travel domain. 



1. Introduction 

Application of learning techniques to planning is an area of long standing re- 
search interest. Most work in this area to-date has, however, only considered 
learning domain physics or search control. The relatively neglected alternative, 
and the focus of this work, is learning preferences. It has long been understood 
that users may have complex preferences on plans (c.f. An effective repre- 
sentation for preferences (among other possible purposes) is a Hierarchical Task 
Network (HTN). In addition to domain physics (in terms of primitive actions and 
their preconditions and effects), the planner is provided with a set of tasks (non- 
primitives) and methods (schemas) for reducing each into a combination of prim- 
itives and sub-tasks (which must then be reduced in turn). A plan (sequence of 
primitive actions) is considered valid if and only if it (a) is executable and achieves 
every specified goal, and (b) can be produced by recursively reducing a specified 
task (the top-level task). 

For the example in Figure [T] the top level task is to travel (and the goal is to 
arrive at some particular destination); acceptable methods reduce the travel task 
to either Goby train or Gobybus. In contrast, the plan of hitch-hiking (modeled as 
a single action), while executable and goal achieving, is not considered valid — 
the user in question loathes that mode of travel. In this way we can separately 
model physics and (boolean) preferences; to accommodate degree of preference 
(i.e., more than just accept/loathe) we attach probabilities to the methods reducing 
tasks (and equate probable with preferred), arriving at probabilistic Hierarchical 
Task Networks (pHTNs). 

While pHTNs can effectively model preferences, manual construction (i.e.. 
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preference elicitation) is complex, error prone, and costly. In this paper, we focus 
on automatically learning, i.e. by observing only user behavior, pHTNs capturing 
user preferences. Our approach takes off from the view of task networks as gram- 
mars fli . First, as mentioned, we generalize by considering pHTNs rather than 
HTNs (to accommodate degree of preference) llJ So each task is associated with 
a distribution over its possible reduction schemas, and probable plans are inter- 
preted as preferred plans. Then we exploit the connection between task reduction 
schemas and production rules (in grammars) by adapting the considerable work on 
grammar induction [4, ^ ^ . Specifically, we view plans as sentences (primitive 
actions are seen as words) generated by a target grammar, adapt an expectation- 
maximization (EM) algorithm for learning that grammar (given a set of example 
plans/sentences), and interpret the result as a pHTN modeling user preference. 

Note that in the foregoing we have assumed that the distribution of example 
plans directly reflects user preference. Certainly preferred plans will be executed 
more often than non-preferred plans, but with equal certainty, reality often forces 
compromise. For example, a (poor) graduate student may very well prefer, in 
general, to travel by car, but will nonetheless be far more frequently observed 
traveling by foot. In other words, by observing the plans executed by the user we 
can relatively easily learn what the user usually does^ and so can predict their 
behavior as long as feasibility constraints remain the same. It is a much trickier 
matter to infer what the user truly prefers to do, and it is this piece of knowledge 
that would allow predicting what the user will do in a novel (and improved) sit- 
uation. Towards this end, in the second part of the paper, we describe a novel, 
but intuitive, extension of the core EM learning technique that rescales the input 
distribution in order to undo possible filtering due to feasibility constraints. The 
idea is to automatically generate (presumably less preferred) alternatives (e.g. by 
an automated planner) to the user's observed behavior and use this additional in- 
formation to appropriately reweight the distribution on observed plans. 

In the following sections, we start by formally stating the problem of learning 
probabilistic hierarchical task networks (pHTNs). Next, we discuss the relations 
between probabilistic grammar induction and pHTN learning, and present an al- 
gorithm that acquires pHTNs from example plan traces. The algorithm works 
in two phases. The first phase hypothesizes a set of schemas that can cover the 



'Of course pHTNs can be trivially converted to HTNs if desired, by simply ignoring the learned 
weights (and if desired, to prevent overfitting perhaps, removing particularly unlikely reductions 
by setting some threshold). 

useful piece of knowledge in the plan recognition scenario 
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training examples, and the second is an expectation maximization phase that re- 
fines the probabilities associated with the schemas. We then evaluate our approach 
against models of users, by comparing the distributions of observed and predicted 
plans. Subsequently we consider possible obfuscation from feasibility constraints 
and describe our rescaling technique in detail. We go on to demonstrate its effec- 
tiveness against randomized models of feasibility constraints. Finally we discuss 
related work and summarize our contributions. 

2. Probabilistic Hierarchical Task Networks 

Definitions. A pHTN domain 7{ is a 3-tuple, V, = {A, T,M.), where ^ is a set 
of primitive actions, T is a set of tasks (non-primitives), and is a set of meth- 
ods (reduction schemas). A pHTN problem 7^ is a 3-tuple, R = {I, G, T), with 
/ the initial state, G the goal, and T E T the top level task to be reduced. Each 
method m G is a (A; + 2)-tuple, (Z, 6, mi, m2, . . . , mk), where each nii is a 
task or primitive and is the probability of reducing Z by m: let Ai{Z) denote 
all methods that can reduce Z, then J2meM(z) ^("^) = 1- Without loss of gen- 



eralitylj we restrict our attention to Chomsky normal form: each method decom- 
poses a task into either two tasks or one primitive. So for any method m, either 
m = {Z, e, X, Y) (also written Z XY, 6), with X, F G T, or m = (Z, 6, a) 
(also written Z ^ a,9) with a E A. Table [T] provides an example of a pHTN do- 
main in Chomsky normal form modeling the Travel domain (see Figure [B, in the 
hypothesis space of our learner (hence the meaningless task names). According 
to the table, the user prefers traveling by train (80%) to traveling by bus (20%). 

For primitives, we follow STRIPS semantics: Each primitive action defines 
a transition function on states, and from an initial state / executing some se- 
quence ai, a2, ■ ■ ■ , ttk of primitives produces a sequence of states sq = I, si = 
ctil^o), S2 = a2{si), . . . ,Sk = ak{sk-i), provided each aj has its preconditions 
satisfied in Such a sequence is goal-achieving if the goal G is satisfied in the 
final state, Sk (goals take the same form as preconditions). 

Concerning tasks, a primitive sequence is a preferred solution if there exists 
a parse of (f) by the methods of H with root T (preferred), and is executable 
from / and achieves G (solution). A parse A" of by with root T is a tree, 
more specifically a rooted almost-binary directed ordered labeled tree, with (a) 



■'Any CFG can be put in Chomsky normal form by introducing sufficiently many auxiliary 
non-primitives. This remains true for probabilistic context-free grammars. 
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root labeled by T, (b) leaves labeled by (j) (in order), and (c) each internal vertex 
is decomposed into its children (in order) by some m E Ai. For such internal 
vertices, say v, let T{v), m{v), and 0{v) be the associated task, reducing method, 
and prior probability of that reduction. The prior probability of an entire parse 
tree is the product of 9{v) over every internal vertex v. Given a fixed root, the 
prior probability of a primitive sequence is the sum of prior probabilities of every 
parse of that sequence with the fixed root: 

P(0|?^,T)= Pi^\T,'H) (1) 

parse X 

= E n ^(^); (2) 

parse X internal v 

note that the prior probability of a primitive sequence has nothing to do with 
whether or not the sequence is goal-achieving or even executable. Enumerating 
all parses could, however, become expensive, so in the remainder we approximate 
by considering only the most probable parse of ^ — define: 

n ^(^)- (3) 

internal v 

Learning pHTNs. We can now state the pHTN learning problem formally. Fix 
the total number of task symbols, k, and fix the first task symbol as the top level 
task T. Given a set $ of observed training plans (so each is executable and (pre- 
sumably) goal-achieving), find the most likely pHTN domain, T-L*. We assume a 
uniform prior distribution on domains with k task symbols, so it is equivalent to 
maximizing the likelihood of the observation: 

H* — argmax^ P{'H 
— argmax^ P($ 

= argmax^ P($ 

= argmax^ J]P(0|?^,r). (5) 

Remark. The preceding incorporates several simplifying assumptions, most im- 
portantly we are making the connection to context-free grammars as strong as 



$,r) (4) 

H,T) 



Pin 


IT) 


p($ 1 


T) 



/ P{n I T) ^ uniform prior \ 
' ' I Pf$ I T) constant 
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Table 1 : A probabilistic Hierarchical Task Network in Chomsky normal form. 



Primitives: Buyticket, Getin, Getout, Hitchhike; 

Tasks: Travel, Ai, A2, A3, Bi, B2; 

Travel ^ A2 Bi, 0.2 Travel ^ Ai B2, 0.8 

Bi^AiA3,1.0 52^^2^3,1-0 

Ai Buyticket, 1.0 A2 Getin, 1.0 A3 Getout, 1.0 



possible. In particular our definitions do not permit conditions in the statement 
of methods, so preferences such as "If in Europe, prefer trains to planes." are not 
directly expressible (and so not learnable) in general. Our definitions nominally 
permit parameterized actions, tasks, and methods, but, there is no mechanism for 
conditioning on parameters (e.g., varying the probability of a reduction based on 
the value of a parameter), so it would seem that even indirectly modeling condi- 
tional preference is impossible. This is both true and false; if one is willing to 
entertain somewhat large values of k, then the learning problem can work with a 
ground representation of a parameterized domain, thereby gaining the ability to 
learn subtly different — or wildly different — sub-grammars for distinct ground- 
ings of a parameterized task. Of course, the difficulty of the learning task depends 
very strongly on k: in the following we map terms such as "(buy ?customer ?ven- 
dor ?object ?location ?amount ?currency)" to symbols by truncation ("buy") rather 
than grounding ("buy_mike_joe_bat_walmart_3_dollars") for just this reason. Fu- 
ture work should consider parameters, and contextual dependencies in general, in 
greater depth — perhaps by taking the perspective of feature selection (truncation 
and grounding can be seen as extremes of feature selection) 

3. Learning pHTNs from User Generated Plans 

Our formalization of Hierarchical Task Networks is isomorphic, not just anal- 
ogous, to formal definitions of Context Free Grammars (tasks ^ non-primitives, 
actions ^ words, methods/schemas -H- production rules); this comes at a price, 
but, the advantage is that grammar induction techniques are more or less directly 
applicable. The technique of choice for learning probabilistic grammars, and so 
the choice we adapt to learning pHTNs, is Expectation-Maximization [i6f]. 

Despite formal equivalence, casting the problem as learning pHTNs (rather 
than pCFGs) does make a difference in what assumptions are appropriate. For 
example, we do not allow annotations on the primitives of input sequences giving 



6 



hints concerning non-primitives; for language learning it is reasonable to assume 
that such annotations are available, because the non-primitives involved are agreed 
upon by multiple users (or there is no communication). In particular informa- 
tion sources such as dictionaries and informal grammars can be mined relatively 
cheaply. In the case of preference learning for plans, the non-primitives of interest 
are user-specific mental constructs (preferences), and so it is far less reasonable to 
assume that appropriate annotations could be obtained cheaply. So, unlike learn- 
ing pCFGs, our system must invent all of its own non-primitive symbols without 
any hints. 

Our learner operates in two phases. First a structure hypothesizer (SH) invents 
non-primitive symbols and associated reduction schemas (tasks and methods), as 
needed, in a greedy fashion, to cover all the training examples. In the second 
phase, the probabilities of the reduction schemas are iteratively improved by an 
Expectation-Maximization (EM) approach. The result is a local optima in the 
space of pHTN domains (instead of H* = argmax^ P('H | $,T), the global 
maximum). 

3.1. Structure Hypothesizer (SH) 

We develop a (greedy) structure hypothesizer (SH) in order to generate a set 
of methods that can, at least, parse all plan examples, but more than that, parse 
all the plan examples without resorting to various kinds of trivial grammars (for 
example, parsing each plan example with a disjoint set of methods). The basic idea 
is to iteratively factor out frequent common subsequences, in particular frequent 
common pairs since we work in Chomsky normal form. We describe the details 
in the following; Algorithm [T] summarizes in pseudocode. 

SH learns reduction schemas in a bottom-up fashion. It starts by initializing H 
with a separate reduction for each primitive (from distinct non-primitives); this is 
a minor technical requirement of Chomsky normal form0 Then all plan examples 
are rewritten using this initial set of rules: so far not much of import has occurred. 

Next the algorithm enters its main loop: hypothesizing additional schemas 
until all plan examples can be parsed to an instance of the top level task, T^] 
In short, SH hypothesizes a schema, rewrites the plan examples using the new 



It is not necessary to use a distinct non-primitive for each reduction to a primitive, but it 
does not really hurt either, as synonymous primitives can be identified one level higher up in the 
grammar at a small cost in number of rules. 

^The implementation in fact allows the single rule T Z instead of the set in line[T] but for 
the sake of notation (elsewhere) we assume a strict representation here. 
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Algorithm 1: SH(plan examples $) returns pHTN % 



1 % := {Za ^ a\ a ^ A] ; II primitive action schemas 

2 rewrite-plans{^,l-L); 

3 while not empty{^) do 

4 case |(/) := shortest-plan{^)\ < 2 

5 if |(^| = 2then^:=^ + (r^<^); 

6 thtn ■.= 'HU {T ^ a\ Z ^ a en} 1 1 cj) = Z for some Z 

7 case {{Z, X, d) := best-simple-recursion {(^)) is good enough 

8 if d = left then ?^:=?^ + (Z^ZX); 

9 ifd = rightthenH :=H + (^-)-XZ); 

10 case otherwise 

11 (X, y) := most-frequent-pair {^)\ 

12 ?^ := + (Zxy ^> X y); // Z^y is a new task 

13 rewrite-plans{(t,'H); II Plans rewritten to T are removed 

14 end 

15 initialize-probabilities {H ) ; 

16 return H 



schema as much as possible and repeats until done. At that point probabilities 
are initialized randomly, that is, by assigning uniformly distributed numbers to 
each schema and normalizing by task (so that '^rneM{z) ^("^) becomes 1 for each 
task Z) — the EM phase is responsible for fitting the probabilities to the observed 
distribution of plans. 

In order to hypothesize a schema, SH first searches for evidence of a recursive 
schema: subsequences of symbols in the form {sz, ssz, sssz} or {zs, zss, zsss} 
(simple repetitions). Certainly patterns such as zababab have recursive structure, 
but these are identified at a later stage of the iteration. The frequency of such 
simple repetitions in the entire plan set is measured, as is their average length. If 
both meet minimum thresholds, then the appropriate recursive schema is added 
to "H. The thresholds themselves are functions of the average length and total 
number of (rewritten, remaining) plan examples in $. 

If not (i.e., one or both thresholds are not met), then the frequency count of 
every pair of symbols is computed, and the maximum pair is added as a reduc- 
tion from a distinct (i.e., new) non-primitive. In the prior example of a symbol 
sequence zababab, eventually ab might win the frequency count, and be replaced 
with some symbol, say s. After rewriting the example sequence then becomes 
zsss, lending evidence in future iterations, of the kind SH recognizes, to the ex- 
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Example Plans: ai 4 

(Buylickcl. Gclin. Gckuil) ■* \ 

(Buyticket. Gelin. Gctoiil, Odin. Geloul. Gclin. Getout) ^ "\ 

Constructed schemas: 

A1 A1 ^ ^ 4 

Priniilive actions: Buj/iicfcft.Gefjij.Geiout; ^ ^ 

^ Aj Si ' ' V A \ < 

/ . S1 S1 S1 — 3 / 

, R?W / 

Ai nuynatel A3 A2 A3 A2 A3 2 A1 A2 A3 2 

Getin TTTTTTT TTT 

A? ^ Getoid Buyticket GetirGetout Getin Getout Getin Getout^- 1 Buyticket Getin Getout 1 



Figure 2; A trace of the Structure Hypothesizer on a variant of the Travel domain. 



istence of a recursive schema (of the form z — )• zs); if such a recursive schema is 
added, then eventually the sequence gets rewritten to just z. 

Example. Consider a variant of the Travel domain (Figure [B allowing the traveler 
to purchase a day pass (instead of a single-trip ticket) for the train. Two training 
plans are shown in Figure [21 First SH builds the primitive action schemas: Ai — >■ 
Buyticket, A2 — )■ Getin, and A3 — i- Getout; the updated plan examples are shown 
as level 2 in the Figure. Next, since is the most frequent pair in the plans 

(and there is insufficiently obvious evidence of recursion), SH constructs a rule 
— )■ A2A3. After updating the plans with the new rule, the plans become AiSi 
and AiSiSiSi, depicted as level 3 in the Figure. At this point SH realizes the 
recursive structure (the simple repetition A 1 Si 6*1 6*1), and so adds the rule Ai 
AiSi. After rewriting all plans are parsable to the symbol Ai (let T = Ai), so SH 
is done: the final set of schemas is at the bottom left of Figure |2l 

3.2. Refining Schema Probabilities: EM Phase 

We take an Expectation-Maximization (EM) approach in order to learn appro- 
priate parameters for the set of schemas returned by SH. EM is a gradient-ascent 
method with two phases to each iteration: first the current model is used to com- 
pute expected values for the hidden variables (E-step: induces a well-behaved 
lower bound on the true gradient), and then the model is updated to maximize 
the likelihood of those particular values (M-step: ascends to the maximum of the 
lower bound). Doing so will normally change the expected values of the hidden 
variables, so the process is repeated until convergence, and convergence does in 
fact occur [@]. For our problem, standard (soft- assignment) EM would compute 
an entire distribution over all possible parses (of each plan, at each iteration); as 
the grammars are automatically generated, there may very well be a huge number 
of such parses. So instead we focus on computing just the parse considered most 
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likely by the current parameters, that is, we are employing the hard-assignment 
variation of EM. One beneficial side-effect is that this introduces bias in favor 
of less ambiguous grammars — for in-depth analysis of the tradeoffs involved 
in choosing between hard and soft assignment see [{^[TH]. In the following we 
describe the details of our specialization of (hard-assignment) EM. 

In the E-step, the current model 1-Li is used to compute the most probable 
parse tree, X^^cj)), of each example (from the fixed start symbol T): X^{(t)) = 
argmaXp3j.j,g P{X \ 0, T, T-Li). This computation can be implemented reasonably 
efficiently in a bottom-up fashion since any subtree of a most probable parse is also 
a most probable parse (of the subsequence it covers, given its root, etc.). The first 
level is particularly simple since we associated every primitive, a, with a distinct 
non-primitive, Z„ (so its most probable, indeed only, parse is just Za — )• a). The 
remainder of the parsing computes: 

P{ai,.. .,aj I Z,'H£) = 

max 9 ■ P{ai, . . . ,ak \ X,'He) ■ P{ak+i, . . . ,aj lY,^^), (6) 

for all indices i < j E [n] and tasks (non-terminals) Z (so the parsing com- 
putes 0{n^m) maximizations, each in 0{nr/m) steps, for a worst-case runtime of 
O(n^r) on a plan of length n with m tasks and r rules in the pHTN). Conceviably 
one of the reduction schemas might exist with probability: the implementation 
prunes such schemas rather than waste computation. By recording the rule and 
midpoint (k) winning each maximization, the most probable parse of 0, XKcp), 
can be easily extracted (beginning at P(ai, . . . , a„ | T, "H^)). 

After getting the most probable parse trees for all plan examples, the learner 
moves on to the M-step. In this step, the probabilities associated with each re- 
duction schema are updated by maximizing the likelihood of generating those 
particular parse trees; let X = {XKcp) | G $} and let M[event] count how 
many times the specified event happens in X (for example, M[Z], for some task 
Z, is the total number of times Z appears in the parses X). Then: 

-Hi+i = argmax^, JJ P{X* \ T,n'), 

= argmax^, Yl Yl p{z ^ XY \ n'), 

= argmax^, JJ JJ ^xy""^^'. O) 

Z Z^XYfizxY&W 
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where the maximization is only over different parameterizations (not over all pH- 
TNs). Each task Z enjoys independent parameters, and from above the likelihood 
expression is a multinomial in those parameters {Qzxy = P{Z XY \ %')), 
and so can be maximized simply by setting: 

That is, the E-step completes the input data $ by computing the parses of $ 

expected by Hi, subsequently the M-step treats those parses as ground truth, and 
sets the new reduction probabilities to the 'observed' frequency of such reductions 
in the completed data. This improves the likelihood of the model, and the process 
is repeated until convergence. 

Discussion: Although the EM phase of learning does not introduce new reduc- 
tion schemas, it does participate in structure learning in the sense that it effectively 
deletes reduction schemas by assigning zero probability to them. For this reason 
SH does not attempt to find a completely minimal grammar before running EM. 
Nonetheless it is important that SH generates small grammars, as otherwise over- 
fitting could become a serious problem. Worst choices of a hypothetical structure 
learner would include the trivial grammar that produces all and only the training 
plans; if this occurs the EM algorithm above would happily drive the probabil- 
ity of all other rules to as the included trivial grammar would allow the perfect 
reproduction of training data. 

3.3. Evaluation 

To evaluate our pHTN learning approach, we designed and carried out experi- 
ments in both synthetic and benchmark domains. All the experiments were run on 
a 2.13 GHz Windows PC with 1.98GB of RAM. Although we focus on accuracy 
(rather than CPU time), we should clarify up-front that the runtime for learning is 
quite reasonable — between (almost) 0ms to 44ms per training plan. We take an 
oracle-based experimental strategy, that is, we generate an oracle pHTN %* (to 
represent a possible user) and then subsequently use it to generate behavior $ (a 
set of preferred plans). Our learner then induces a pHTN "H from only $; so then 
we can assess the effectiveness of the learning in terms of the differences between 
the original and learned models. In some settings (e.g., knowledge discovery) it is 
very interesting to directly compare the syntax of learned models against ground 
truth, but for our purposes such comparisons are much less interesting: we can be 
certain that, syntactically, H will look nothing like a real user's preferences (as 
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expressed in pHTN form) for the trivial reason (among others) that H will be in 
Chomsky normal form. For our purposes it is enough for "H to generate an approx- 
imately correct distribution on plans. So the ideal evaluation is some measure of 
the distance between distributions (on plans), for example KuUback-Leibler (KL) 
divergence: 

DKLiVn* 1 1 Vn) = (^) " 54^' (9) 

where Vn and Vn* are the distributions of plans generated by H and "H* respec- 
tively. This measure is for equal distributions, and otherwise goes to infinity. 

However, as given the summation is over the infinite set of all plans, so instead 
we approximate by sampling, but this exacerbates a deeper problem: the measure 
is trivially infinite if Vn gives probability to any plan (that Vn, does not). So 
in the following we take measurements by sampling X plans from T-L* and H, 
obtaining sample distributions Vn* and Vn, and then we prune any plans not in 
Vn* n Vn, renormalize, obtaining "P^, (say Pi) and V^ (say P2), and finally 
compute: 

Din* \\n) = Dkl{Pi II P2) =$^Pi(0) ■ log §7^. (10) 

This is not a good approach if the intersection is small, but in our experiments 
\Vn* n Vn\/\Vn* U '^h\ is close to 1. This modification also imposes an upper 
bound on the measure, of about O(logX). 

3.4. Experiments in Randomly Generated Domains 

In these experiments, we randomly generate the oracle pHTN "H*, by ran- 
domly generating a set of recursive and non-recursive schemas on n non-primitives. 
In non-recursive domains, the randomly generated schemas form a binary and-or 
tree with the goal as the root. The probabilities are also assigned randomly. Gen- 
erating recursive domains is similar with the only difference being that 10% of the 
schemas generated are recursive. For measuring overrall performance we provide 
lOn training plans and take lOOn samples for testing; for any given n we repeat 



the experiment 100 times and plot the mean. The results are shown in Figure [3(a) 
We also discuss two additional, more specialized, evaluations. 

Rate of Learning: In order to test the learning speed, we first measured KL di- 
vergence values with 15 non-primitives given different numbers of training plans 
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Figure 3: Experimental results in synthetic domains (a) KL Divergence values with different num- 
ber of training plans, (b) Measuring conciseness in terms of the ratio between the number of 
actions in the learned and original schemas. 



The results are shown in Figure [3(a)| We can see that even with a relatively small 
number of training examples, our learning mechanism can still construct pHTN 
schemas with divergence no more than 0.2; as expected the learning performance 
further improves given many training examples. As briefly discussed in the setup 
our measure is not interesting unless the learned pHTN can reproduce most test- 
ing plans with non-zero probability, since any probability plans are ignored in 
the measurement — so we do not report results given only a very small number 
of training examples (the value would be artificially close to 0). Here 'very small' 
means too small to give at least one example of every /most reductions in the ora- 
cle schema; without at least one example the structure hypothesizer will (rightly) 
prevent the generation of plans with such structure. 

Effectiveness of the EM Phase: To examine the effect of the EM phase, we car- 
ried out experiments comparing the divergence (to the oracle) before and after 



running the EM phase. Figures [4(a)] and [4(b)] plot results in the non-recursive and 



recursive cases respectively. Overall the EM phase is quite effective, for example, 
with 50 non-primitives in the non-recursive setting the EM phase is able to im- 
prove the divergence from 0.818 (the divergence of the model produced by SH) to 
the much smaller divergence of 0.066. 

Conciseness: The conciseness of the learned model is also an important factor 
measuring the quality of the approach (despite being a syntactic rather than se- 
mantic notion), since allowing huge pHTNs will overfit (with enough available 
symbols the learner could, in theory, just memorize the training data). A simple 
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Figure 4: Experimental results in synthetic domains (a) KL Divergence between plans generated 
by original and learned schemas in non-recursive domains, (b) KL Divergence between plans 
generated by original and learned schemas in recursive domains. 

measure of conciseness, the one we employ, is the ratio of non-primitives in the 
learned model to non-primitives in the oracle (n) — the learner is not told how 
many symbols were used to generate the training data. Figure [3(b)] plots results. 
For small domains (around n = 10) the learner uses between 10 and 20% more 
non-primitives, a fairly positive result. However, for larger domains this result 
degrades to 60% more non-primitives, a somewhat negative result. Albeit the di- 
vergence measure improves — on the hidden test set — so while there is some 
evidence of possible overfitting the result is not alarming. Future work in struc- 
ture learning should nonetheless examine this issue (conciseness and overfitting) 
in greater depth. 

Note: Divergence in the recursive case is consistently larger than in the non- 
recursive case across all experiments: this is expected. In the recursive case the 
plan space is actually infinite; in the non-recursive case there are only finitely 
many plans that can be generated. So, for example, in the non-recursive case, it is 
actually possible for a finite sample set to perfectly represent the true distribution 
(simply memorizing the training data will produce divergence eventually). 

3.5. Benchmark Domains 

In addition to the experiments with synthetic domains, we also picked two of 
the well known benchmark planning domains and simulated possible users (in the 
form of hand-constructed pHTNs). 

Logistics Planning: The domain we used in the first experiment is a variant of 
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Table 2: Learned schemas in Logistics 



Primitives: load, fly, drive, unload; 
Tasks: movePackage, So, Si, S2, S3, S4, S^; 
movePackage movePackage movePackage, 0.17 
movePackage ^ ^5 , . 25 ^ ^3 ^2 , L 

movePackage ^ ^4, 0.58 S'4 ^ 5i ^2, 1.0 

S'o^ load, 1.0 Si ^ fly, 1.0 

5*2 unload, 1.0 S3 drive, 1.0 



the Logistics Planning domain, inside which both planes and trucks are available 
to move packages, and every location is reachable from every other. There are 4 
primitives in the domain: load, fly, drive and unload; we use 11 tasks to express, 
in the form of an oracle pHTN "H* (in Chomsky normal form, hence 1 1 tasks), our 
preferences concerning logistics planning. We presented 100 training plans to the 
learning system; these demonstrate our preference for moving packages by planes 
rather than trucks and for using overall fewer vehicles. 

The divergence of the learned model is 0.04 (against a hidden test set, on a 
single run). While we are generally unconcerned with the syntax of the learned 
model, it is interesting to consider in this case: Table [2] shows the learned model. 
With some effort one can verify that the learned schemas do capture our prefer- 
ences: the second and third schemas for 'movePackage' encode delivering a pack- 
age by truck and by plane respectively (and delivering by plane has significantly 
higher probability), and the first schema permits repeatedly moving packages, 
but with relatively low probability. That is, it is possible to recursively expand 
'movePackage' so that one package ends up transferring vehicles, but, the plan 
that uses only one instance of the first schema per package is significantly more 
probable (by 0.17^^ for k transfers between vehicles). 

Gold Miner: The second domain we used is Gold Miner, introduced in the learn- 
ing track of the 2008 International Planning Competition. The setup is a (futuris- 
tic) robot tasked with retrieving gold (blocked by rocks) within a mine; the robot 
can employ bombs and/or a laser cannon. The laser cannon can destroy both hard 
and soft rocks, while bombs only destroy soft rocks, however, the laser cannon 
will also destroy any gold immediately behind its target. The desired strategy, 
which we encode in pHTN form using 12 tasks (H*), for this domain is: 1) get 
the laser cannon, 2) shoot the rock until reaching the cell next to the gold, 3) get 
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Table 3: Learned schemas in Gold Miner 



Primitives: move, getLaserGun, shoot, getBomb, getGold; 

Tasks: goal, ^o, 5*1, ^2, S3, S^, S5, Sq; 

goal So goal, 0.78 goal Si ^e, 0.22 

50 move, 1.0 ^5 S2 So, 1.0 

51 getLaserGun, 0.22 Si Si S5, 0.78 

52 shoot, 1.0 Sq S3 S4, 1.0 

53 getBomb, 0.29 S3 -)■ S3 So, 0.71 
S'4 ^ getGold, 1.0 



a bomb, 4) use the bomb to get gold. 

We gave the system 100 training plans of various lengths (generated by H*); 
the learner achieved a divergence of 0.52. This is a significantly larger divergence 
than in the case of Logistics above, which can be explained by the significantly 
greater use of recursion (one can think of Logistics as less recursive and Gold 
Miner as more recursive, and as noted in the random experiments, recursive do- 
mains are much more challenging). Nonetheless the learner did succeed in quali- 
tatively capturing our preferences, which can be seen by inspection of the learned 
model in Table |3l Specifically, the learned model only permits plans in the order 
given above: get the laser cannon, shoot, get and then use the bomb, and finally 
get the gold. 

4. Preferences Constrained by Feasibility 

In general, users will not be so all-powerful that behavior and desire coincide. 
Instead a user must settle for one (presumably the most desirable) of the feasi- 
ble possibilities. Supposing those possibilities remain constant then there is little 
point in distinguishing desire and behavior; indeed, the philosophy of behaviorism 
defines preference by considering such controlled experiments. Supposing instead 
that feasible possibilties vary over time, then the distinction becomes very impor- 
tant. One example we have already considered is that of a poor grad student: in the 
rare situation that such a student's travel is funded, then it would be desirable to 
realize the preference for planes over cars. In addition to that example, consider 
the requirement to go to work on weekdays (so the constraint does not hold on 
weekends). Clearly the weekend activities are the preferred activities. However, 
the learning approach developed so far would be biased — by a factor of | — in 
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favor of the weekday activities. In the following we consider how to account for 
this effect: the effect of feasibility constraints upon learning preferences. 

Recall that we assume that we can directly observe a user's behavior, for exam- 
ple by building upon the work in plan recognition. In this section we additionally 
assume that we have access to the set of feasible alternatives to the observed be- 
havior — for example by assuming access to the planningproblem the user faced 
and building upon the work in automated planning fl2^ |j For our purposes it is 
not a large sacrifice to exclude any number of feasible alternatives that have never 
been chosen by the user (in some other situation), in particular we are not con- 
cerned about the potentially enormous number of feasible alternatives (because 
we can restrict our attention to a subset on the order of the number of observed 
plans)0 So, in this section, the input to the learning problem becomes: 

Input. The i^^ observation, Fi) E $, consists of a set of feasible possi- 
bilities, Fi, along with the chosen solution: (pi G Fi. 

In the rest of the section we consider how to exploit this additional training 
information (and how to appropriately define the new learning task). The main 
idea is to rescale the input (i.e., attach weights to the observed plans (pi) so that rare 
situations are not penalized with respect to common situations. We approach this 
from the perspective that preferences should be transitively closed, and as a side- 
effect we might end up with disjoint sets of incomparable plans. Subsequently we 
apply the base learner to each, rescaled, component of comparable plans to obtain 
a set of pHTNs capturing user preferences. Note that the result of the system can 
now be 'unknown' in response to 'is A preferred to B?', in the case that A and 
B are not simultaneously parsable by any of the learned pHTNs. This additional 
capability somewhat complicates evaluation (as the base system can only answer 
'yes' or 'no' to such queries). 



^Since we already assume plan recognition, it is not a signficant stretch to assume knowledge 
of the planning problem itself. Indeed, planning problems are often recast as a broken plan on two 
dummy actions (initial and terminal), and solutions as insertions that fix the problem-as-plan. In 
particular recognizing plans subsumes, in general, recognizing problems. 

^Work in diverse planning ifisi [Till is, then, quite relevant (to picking a subset of feasible 
alternatives of manageable size). 
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4.1. Analysis 

Previously we assumed the training data (observed plans) $ was sampled 
(i.i.d.) directly from the user's true preference distribution (say U): 



But now we assume that varying feasibility constraints intervene. For the sake 
of notation, imagine that such variation is in the form of a distribution, say T, 
over planning problems (but all that is actually required is that the variation is 
independent of preferences, as assumed below). Note that a planning problem is 
logically equivalent to its solution set. Then we can write P{F \ to denote 
the prior probability of any particular set of solutions F. Since the user chooses 
among such solutions, we have that chosen plans are sampled from the posterior, 
over solutions, of the preference distribution: 



We assume that preferences and feasibility constraints are mutually independent: 
what is possible does not depend upon desire, and desire does not depend upon 
what is possible. One can certainly imagine either dependence — respectively 
Murphy's Law (or its complement) and the fox in Aesop's fable of Sour Grapes 
(or envy) — but it seems to us more reasonable to assume independence. Then 
we can rewrite the posterior of the preference distribution: 



P($|W,^)= n ^ pL/| ia ■ I (by assumption). 



Anyways assuming independence is important, because it makes the preference 
learning problem attackable. In particular, the posteriors preserve relative prefer- 
ences — for all 0, 0' e F, the odds of selecting over 0' are: 



P($ I W) = JJp(0 I U). 



P($|W,J^)= Yl P{<j)\U,F)- P{F \J^). 



(0,F)G* 



0(0,0'): 



P(0' \u,Fy 

P{p I U) 



P{q' I U) 




pir I w) • E, 



P(0" I U) ' 
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Therefore we can, given sufficiently many of the posteriors, reconstruct the prior 
by transitive closure; consider 0, 0', (f)" with 0, 0' e F and 0', (f)" e F': 

o{(p,r) = o{(p,(P')-o{(p'A"), 



P{<t>'\ 


U, F') 


P{(l)" 


\U,F') 



So then the prior can be had by normalization: 

Of course none of the above distributions are accessible; the learning problem is 

only given $. Let M,.[0] = \{i \ (0, F) = (</.,, F,) e $}|, Mf = T,4,Mf[4>], and 
M = YIf = 1^1- Then $ defines a sampling distribution: 



P(0 I U, F) (in the limit). 



in particular: 



6^(0, 00 := ^ - 4^ (in the limit) (11) 

for any F — but for anything less than an enormous amount of data one expects 
Op and Op' to differ considerably for F ^ F', therein lying one of the subtle 
aspects of the following rescaling algorithm. The intuition is, however, simple 
enough: pick some base plan and set its weight to an appropriately large value 
w, and then set every other weight, for example that of cf)', to w ■ 0{(f)', cf)) (where 
O is some kind of aggregation of the differing estimates Op)', finally give the 
weighted set of observed plans to the base learner. From the preceding analysis, 
in the limit of infinite data, this setup will learn the (closest approximation, within 
the base learner's hypothesis space, to the) prior distribution on preferences. 

To address the issue that different situations will give different estimates (due 
to sampling error) for the relative preference of one plan to another (Op and Op' 
will differ) we employ a merging process on such overlapping situations. Consider 
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two weighted sets of plans, c and d, and interprej^ the weight of a plan as the 
number of times it 'occurs', e.g., Wc{(p) = Mc[0]. In the simple case that there is 
only a single plan in the intersection, {a} = cCid, there is only one way to take a 
transitive closure — for all </> in c and (p' E d\c: 

d,d{<l>A')=Oc{<P,a)-dd{a,<f)'), 

MM] .^^ MM 

with s 



so in particular we can merge d into c by first rescaling d: 



M,44>] 



M^] if (j)ec 

s ■ MJd)] otherwise 



In general let s'^ = for any a G c f] d he the scale factor of c and d 

w.r.t. a. Then in the case that there are multiple plans in the intersection we are 
faced with multiple ways of performing a transitive closure, i.e., a set of scale 
factors. These will normally be different from one another, but, in the limit of 
data, assuming preferences and feasibility constraints are actually independent of 
one another, every scale factor between two clusters will be equal. So, then, we 
take the average: 



(mm if (f)ec 

Med <!>]■.= { J ^J. . (13) 

[ s ■ Mrf[0] otherwise 

In short, if all the assumptions are met, and enough data is given, the described 
process will reproduce the correct prior distribution on preferences. Algorithm [2] 
provides the remaining details in pseudocode, and in the following we discuss 
these details and the result of the rescaling process operating in the Travel domain. 



The scaling calculations could produce non-integer weights. 
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4.2. Rescaling 

Output. The result of rescaling is a set of clusters of weighted plans, C = 
{ci, C2, . . . , c„}. Each cluster, c E C, consists of a set of plans with associated 
weights; we write p G c for membership and Wc{p) for the associated weight. 

Clustering. First we collapse all of the input records from the same or similar sit- 
uations into single weighted clusters, with one count going towards each instance 
of an observed plan participating in the collapse. For example, suppose we ob- 
serve 3 instances of Gobyplane chosen in preference to Gobytrain and 1 instance 
of the reverse in similar or identical situations. Then we will end up with a cluster 
with weights 3 and 1 for Gobyplane and Gobytrain respectively. In other words 
'Wc{p) is the number of times p was chosen by the user in the set of situations col- 
lapsing to c (or e if p was never chosen). This happens in lines [2l-[2]of Algorithm[2l 
which also defines 'similar' (as set inclusion). Future work should consider more 
sophisticated clustering methods. 

Transitive Closure. Next we make indirect inferences between clusters; this hap- 
pens by iteratively merging clusters with non-empty intersections. Consider two 
clusters, c and d, in the Travel domain. Say d contains Gobyplane and Gobytrain 
with counts 3 and 1 respectively, and c contains Gobytrain and Gobybike with 
counts 5 and 1 respectively. From this we infer that Gobyplane would be executed 
15 times more frequently than Gobybike in a situation where all 3 plans (Goby- 
plane, Gobytrain, and Gobybike) are possible, since it is executed 3 times more 
frequently than Gobytrain which is in turn executed 5 times more frequently than 
Gobybike. We represent this inference by scaling one of the clusters so that the 
shared plan has the same weight, and then take the union. In the example, suppos- 
ing we merge d into c, then we scale d so that cD d = {Gobytrain} has the same 
weight in both c and d, i.e. we scale by 5 = ^^(oobyll-ain) • For pairs of clusters 

with more than one shared plan we scale \ c by the average of for each plan 
in the intersection, but we leave the weights of c fl (i as in c (one could consider 
several alternative strategies for plans in the intersection). Computing the scaling 
factor happens in lines [2l-[2l and the entire merging process happens in lines [2l-[2l of 
Algorithm [H 

4.3. Learning 

We learn a set of pHTNs for C by applying the base learner (with the obvious 
generalization to weighted input) to each c E C: 

H = {ifc = EM(SH(c)) \ceC}. (14) 



21 



Algorithm 2: Rescaling 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
IS 
16 
17 
18 



Input: Training records 
Output: Clusters C. 
initialize C to empty 
forall (0, F) G $ do 

if 3c G C such that F C c or F ^ c then 
forall p e F\cdo 

I add p to c with Wc{p) ■= e 
end 

if Wci4>) > 1 then 

I increment Wc{(t)) 
else 

I Wc{(j)) := 1 
end 



se 



initialize c to empty 
add c to C 
forall p € F do 

I add p to c with Wc{p) ■= e 
end 

Wc{(l)) := 1 



end 
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20 end 

21 while 3c,d e C such that c fl d 7^ do 
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23 
24 
25 
26 
27 
28 
29 
30 

31 end 

32 return C 



sum_ratios := 
forall p G c n d do 



I sum_ratios += ^^^4 
end 
scale := 

\cr\d\ 
forall p G (i \ c do 

I add p to c with Wc{p) ■= Wd{p) ■ scale 
end 

remove d from C 
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Figure 5: Experimental results for random %* . "EA" is learning with rescaling and "OA" is 
learning without rescaUng. (a) Learning rate, (b) Size dependence: "R" for recursive T-L* and 
"NR" for non-recursive W. 

While the input clusters will be disjoint, the base learner may very well generalize 
its input such that various pairs of plans become comparable in multiple pHTNs 
within H. Any disagreement is resolved by voting; recall that, given a pHTN % 
and a plan p, we can efficiently compute the most probable parse of p by and 
its (a priori) likelihood, say iuip)- Given two plans p and q we let order p and 
q by i-S-' P '^n Q ^^=^ ^nip) < ^nil)'^ if either is not parsable (or tied) 

then they are incomparable by H. Given a set of pHTNs H = {Hi, "^2, • • •}> we 
take a simple majority vote to decide p Q (ties are incomparable): 

P^hQ ^ \{neU\q^nP}\<\{neU\p^nq}\- (15) 

So, each pHTN votes, based on likelihood, forp^q (meaning p is preferred 

to q), or q -< p (q is preferred to p), or abstains (the preference is unknown). 
Summarizing, the input $ is 1) clustered and 2) transitively closed, producing 
clusters C, 3) each of which is given to the base learner, resulting in a set of 
pHTNs H modeling the user's preferences via the relation -<h- 

4.4. Evaluation 

In this part we are primarily interested in evaluating the rescaling extension of 
the learning technique, i.e., the ability to learn preferences despite feasibility con- 
straints. We design a simple experiment to demonstrate that learning purely from 
observations is easily confounded by constraints placed in the way of user pref- 
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erences, and that our rescaling technique is able to recover preference knowledge 
despite obfuscation. 

4.4.1. Setup 

Performance. We again take an oracle-based experimental strategy, that is, 
we imagine a user with a particular ideal pHTN, 1-L*, representing that user's pref- 
erences, and then test the efficacy of the learner at recovering knowledge of pref- 
erences based on observations of the imaginary user. More specifically we test the 
learner's performance in the following game. After training the learner produces 
H,.; to evaluate the effectiveness of H,. we pick random plan pairs and ask both 
1-L* and to pick the preferred plan. There are three possibilities: agrees 
with 1-L* (+1 point), H^. disagrees with H* (-1 point), and declines to choose 
(0 pointsfl. 

The distribution on testing plans is not uniform and will be described below. 
The number of plan pairs used for testing is scaled by the size of H*; lOOt pairs 
are generated, where t is the number of non-primitives. The final performance for 
one instance of the game is the average number of points earned per testing pair. 
Pure guessing, then, would get (in the long-term) performance. 

User. As in the prior evaluation we evaluate on 1) randomly generated pHTNs 
modeling possible users, and on 2) hand-crafted pHTNs modeling our preferences 
in Logistics and Gold Miner. 

Training Data. For both randomly generated and hand-crafted users we use the 
same framework for generating training data. We generate random problems by 
generating random solution sets in a particular fashion, that is, we model feasibility 
constraints using a particular random distribution on solution sets. We describe 
this process in detail below, but note that the details are unimportant (we did not try 
any alternatives to the choices given); what matters is whether or not the process 
is a reasonable model of the effect of feasibility constraints (insofar as they affect 
the learning problem). 

We begin by constructing a list of plans, V, from lOOt samples of H*, remov- 
ing duplicates (so \V\ < lOOt). Due to duplicate removal, less preferred plans 
occur later than more preferred plans (on average). We reverse that order, and 
associate V with (a discrete approximation to) a power-law distribution. Both 
training and test plans are drawn from this distribution. Then, for each training 



^This gives rescaling a potentially significant advantage, as learning alone always chooses. We 
also tested scoring "no choice" at -1 point; the results did not (significantly) differ. 
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record Fj), we take a random numbe]0 of samples from V as Fj. We sample 
the observed plan, 0j, from Fj by that is, the probability of a particular choice 

Note, then, that the random solution sets model the "worst case" of feasibility 
constraints, in the sense that it is the least preferred plans that are most often 
feasible — much of the time the hypothetical user will be forced to pick the least 
evil rather than the greatest good. 

Baseline. The baseline for our experiments will be the original approach: the base 
learner without rescaling. That is, we take a single cluster, where the weight of 
each plan is the number of times it is observed w{(j)) = \{i \ (p = (pi}\, and apply 
the base learner, obtaining a single pHTN, H;, = {H}, and score it in the same 
manner that the extended approach is scored by. 

4.4.2. Results: Random W 



Rate of Learning. Figure |5(a)| presents the results of a learning-rate experi- 
ment against randomly selected "H*. For these experiments the number of non- 
primitives is fixed at 5 while the amount of training data is varied; we plot the 
average performance, over 100 samples of "H*, at each training set size. 

We can see that with a large number of training records, rescaling before learn- 
ing is able to capture nearly full user preferences, whereas learning alone performs 
slightly worse than random chance. This is expected since without rescaling the 
learning is attempting to reproduce its input distribution, which was the distribu- 
tion on observed plans — and "feasibility" is inversely related to preference by 
construction. That is, given the question "Is A preferred to BT the learning alone 
approach instead answers the question "Is A executed more often than BT\ 

Size Dependence. We also tested the performance of the two approaches un- 
der varying number of non-primitives (using 50t training records); the results are 
shown in Figure [5(b)j For technical reasons, the base learner is much more effec- 
tive at recovering user preferences when these take the form of recursive schemas, 
so there is less room for improvement. Nonetheless the rescaling approach im- 
proves upon learning alone in both experiments. 



'"The number of samples taken is selected from \V\ ■ |A/'(/, oo)|/2, subject to minimum 2 and 
maximum [P], where M{.) is the normal distribution. Larger solution sets model 'easier' planning 
problems. 
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4.4.3. Results: Hand-crafted l-L* 

We re-use the same pHTNs encoding our preferences in Logistics and Gold 
Miner from the first set of evaluations. As mentioned we use the same setup as 
in the random experiments, so it continues to be the case that the distribution on 
random 'solutions' is biased against the encoded preferences. Moreover, due to 
the level of abstraction used (truncating to action names), as well as the nature of 
the pHTNs and domains in question, the randomly generated sets of alternatives, 
Fi, are in fact sets of solutions to some problem expressed in the normal fashion 
(i.e., as an initial state and goal). 

Logistics. After training with 550 training records (50t, for 1 1 non-primitives) 
the baseline system scored only 0.342 (0 is the performance of random guessing) 
whereas rescaling before learning performed significantly better with a score of 
0.847 (0.153 away from perfect performance). 

Gold Miner: After training with 600 examples (50t for 12 non-primitives) learn- 
ing alone scored a respectable 0.605, still, rescaling before learning performed 
better with a score of 0.706. Note that the greater recursion in Gold Miner, as com- 
pared to Logistics, is both hurting and helping. On the one hand the full approach 
scores worse (0.706 vs. 0.847), on the other hand, the baseline's performance is 
hugely improved (0.605 vs. 0.342). As discussed previously, the presence of re- 
cursion in the preference model makes the learning problem much harder (since 
the space of acceptable plans is then actually infinite), which continues to be a 
reasonable explanation of the first effect (degrading performance). 

The latter effect is more subtle. The experimental setup, roughly speaking, 
inverts the probability of selecting a plan, so that using a recursive method many 
times in an observed plan is more likely than using the same method only a few 
times. Then the baseline approach is attempting to learn a distribution skewed to- 
wards greater use of recursion overall, and in particular, a distribution that prefers 
more recursion to less recursion all else being equal. However, there is no pHTN 
that prefers more recursion to less recursion all else being equal; fewer uses of a 
recursive method always increases the probability of a plan. So the baseline will 
fit an inappropriately large probability to any recursive method, but, it will still 
make the correct decision between two plans differing only in the depth of their 
recursion over that method. Naive Bayes Classifiers exhibit a similar effect |@, 
Box 17.A]. 
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5. Discussion and Related Work 



In the planning community, HTN planning has for a long time been given two 
distinct and sometimes conflicting interpretations (c.f. it can be interpreted 
either in terms of domain abstractiorMor in terms of expressing complex (not 
first order Markov) constraints on piano The original HTN planners were moti- 
vated by the former view (improving efficiency via abstraction). In this view, only 
top-down HTN planning makes sense as the HTN is supposed to express effective 
search control. Paradoxically, w.r.t. that motivation, the complexity of HTN plan- 



ning is substantially worse than planning with just primitive actions u5\\ . The lat- 
ter view explains the seeming paradox easily — finding a solution should be eas- 
ier, in general, than finding one that also satisfies additional complex constraints. 
From this perspective both top-down and bottom-up approaches to HTN planning 
are appropriate (the former if one is pessimistic concerning the satisfiability of the 
complex constraints, and the latter if one is optimistic). Indeed, this perspective 



lead to the development of bottom-up approaches 11611 . 



Despite this dichotomy, most prior work on learning HTN models (e.g. rtl7 . 



ISLllS, 20J) has focused only on the domain abstraction angle. Typical approaches 
here require the structure of the reduction schemas to be given as input, and fo- 
cus on learning applicability conditions for the non-primitives. In contrast, our 
work focuses on learning HTNs as a way to capture user preferences, given only 
successful plan traces. The difference in focus also explains the difference in 
evaluation techniques. While most previous HTN learning efforts are evaluated in 
terms of how close the learned schemas and applicability conditions are, syntac- 
tically, to the actual model, we evaluate in terms of how close the distribution of 
plans generated by the learned model is to the distribution generated by the actual 
model. 

An intriguing question is whether pHTNs learned to capture user preferences 
can, in the long run, be over-loaded with domain semantics. In particular, it 
would be interesting to combine the two HTN learning strands by sending our 
learned pHTNs as input to the method applicability condition learners. Presum- 
ing the user's preferences are amenable, the applicability conditions thus learned 
might then allow efficient top-down interpretation (of course, the user's prefer- 
ences could, in light of the complexity results for HTN planning, be so antithetical 
to the nature of the domain that efficient top-down interpretation is impossible). 



^'Non-primitives are seen as abstract actions, mediating access to the concrete actions. 
'^Non-primitives are seen as standing for complex preferences (or even physical constraints). 
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As discussed in ^ there are other representations for expressing user prefer- 
ences, such as trajectory constraints expressed in linear temporal logic. It will be 
interesting to explore methods for learning preferences in those representations 
too, and to see to what extent typical user preferences are naturally expressible in 
(p)HTNs or such alternatives. 

6. Conclusion 

Despite significant interest in learning in the context of planning, most prior 
work focused only on learning domain physics or search control. In this paper, we 
expanded this scope by learning user preferences concerning plans. We developed 
a framework for learning probabilistic HTNs from a set of example plans, draw- 
ing from the literature on probabilistic grammar induction. Assuming the input 
distribution is in fact sampled from a pHTN, we demonstrated that the approach 
finds a pHTN generating a similar distribution. It is, however, a stretch to imagine 
that we can sample directly from such a distribution — chiefly because observed 
behavior arises from a complex interaction between preferences and physics. 

We demonstrate a technique overcoming the effect of such feasibility con- 
straints, by reasoning about the available alternatives to the observed user be- 
havior. The technique is to rescale the distribution to fit the assumptions of the 
baseline pHTN learner. We evaluate the approach, and demonstrate both that the 
original learner is easily confounded by constraints placed upon the preference 
distribution, and that rescaling is effective at reversing this effect. We discuss 
several remaining important directions for future work to address. Of these, the 
most directly relevant technical pursuit is learning parameterized pHTNs, or more 
generally, learning conditional preferences. Fully integrating an automated plan- 
ner with the learner, thereby using the learned knowledge, and running (costly...) 
user studies are also very important pursuits. In the end, we describe an effective 
approach to automatically learning a model of a user's preferences from observa- 
tions of only their behavior. 
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